1 Motivation

Excel reports are everywhere you look. The reports are generated by accountants, sales and engineering teams, statistical bureaus and municipalities to name a few. These reports are organized and formatted with the intention of making it easier to read and understand the report. However, there is no standard way of generating Excel report. Depending on the nature of the report producer and the workflow of a given organization, Excel reports vary and often contains several rows and columns of numbers which can be challenging to comprehend.

As discussed on my last blog how to transform Excel data to R, R is a software tool that can be employed to compliment and enhance Excel reports. R, with tidyvrse libraries, can help further optimize data analysis by adding capabilities such as pulling advanced statistical summaries, enabling reproducability and automation of reports, modeling and generating visual summaries that overcome some of Excel’s limitations.

2 Enhancing Excel analysis with R

On this blog I will go over three Excel reports to demonstrate how R (tidyverse libraries) is used to convert Excel report to a “tidy” format. “Tidying is structuring datasets to facilitate analysis” read here about tidy and generate a visual summaries that compliment the report, and I believe easier to understand. The main objective of this blog is to demonstrate Excel formatted data conversation to to tidy format using R’s tidyverse libraries.

I will be using publicly available Excel data from [Toronto Transport Commission Ridership], a typical [cash flow] Excel report and Excel report from 2017 [Australia statistics Bureau] survey regarding marriage. This three reports were used as case study in the excellent Spreadsheet Munging Strategies book by Duncan Garmonsway, which employs the tidyxl and unpivotr packages. Although this two packages are very good and helpful working with Excel, I will be using strictly tidyverse for all Excel munging on this blog.

2.1 General Workflow for each use case

For each example, we will identify the Excel report and label the areas we will need to consider for munging, and generate an R script that will render a tidy-data-table, that I will use to generate a visual graphic summarization plot. Of course, once data is in tidy format, it makes it easier to extract additional knowledge applying empirical and statistical methods, extending the analysis for additional insights. But for this blog, we will limit to visual summarization.

If you want to follow along, download all relevant unprocessed datahere

3 Toronto Public Transport Ridership Trend 1985-2017

The Toronto Transport Commission (TTC) generates statistical data of ridership which can be downloaded from here, periodically, for public consumption in Excel. When you open the Excel, you can see that, while the table is formatted for human readers, it isn’t in a tidy format. To turn the Excel into tidy format, a number of transformation will have to be implemented. With out further ado, lets jump right into the steps with the TTC Excel.

3.1 Toronto Transport Commission (TTC) Excel

I use the As you can see on Figure 3.1 there are ten items to consider before arriving to a tidy format. While the header and footers, blank cells, variable name liked ‘Adult & Bus’ in data space, NA values and subtotal make it easier to read, they are not needed in a tidy format. I have used a tidyverse script to mung the Excel data, that take into account each of ten items highlighted (follow the note on the script for details).

Toronto Transit Excel data annotated.

Figure 3.1: Toronto Transit Excel data annotated.

3.2 TTC Excel Transformation to Tidy table

The first step in starting the process is importing the Excel data into R system. I use the readxl library that imports ‘.xls’ and ‘.xlsx’ Excel files. The readxl library provides several useful features (technically referred to as ‘arguments’) that reduces the amount of tasks used to import and transform Excel data. For example, if there are multiple sheets on the spreadsheet, you can specify which sheet you want to work from with the sheet argument, if you want to skip the first n rows, you can use the skip argument, specify cell with cell_rows, columns with range arguments, and also be able to define the column names and type with built in arguments. I will not dive into details about every function in the readxl library, but it is worth reading and understanding the arguments from the help pages when importing Excel data.

In addition to the readxl library, I have used data carpentry libraries in tidyverse including stringr, dplyr and tidyr. You will notice in the script functions such as select, filter, mutate, fill and slice are used to clean, reshape, modify and render the tidy data.
( _ Note: read the notes following # to follow along the transformation._ )

# Import the Excel data and use the "N/A" marker for missing cell values and skip the first 4 rows and start with the 5th.
toronto_transit <- read_excel("data/toronto_transit.xlsx", na = "N/A", skip = 4)

# rename the first 2 columns, append fy on the rest of the columns
df <- 
toronto_transit %>%
        select_all(~str_replace(., "^", "fy_")) %>%
        select(type = 1, fare_media = 2, c(3:35))
       
# Tidy data
# the set of instuction will complete unfilled cells, remove the lines that contain "TOTAL", replace na values with 0,
# create a features that will relable values to make it easier to understand, select the features we need to generate the data,
# and generate the tidy data table that we will use for analysis.
df_tidy <-
df %>%
     fill(type) %>% 
     slice(1:50)  %>% 
     filter(!str_detect(fare_media, "TOTAL")) %>%
     mutate_at(vars(fy_2017:fy_1985), ~replace(., is.na(.), 0)) %>% 
     mutate(type_of = case_when(fare_media == "ADULT" ~ "adult",
                                fare_media == "SENIOR/STUDENT" ~ "senior_student",
                                fare_media == "CHILDREN" ~ "children",
                                fare_media == "BUS" ~ "the_bus",
                                fare_media == "RAIL" ~ "rail",
                                fare_media == "WEEKDAY" ~ "day_of_week"
                                )) %>%
     select(type, fare_media, type_of, fy_2017:fy_1985) %>% 
     fill(type_of) %>% 
     mutate(line_num = seq(1:n())) %>% 
     filter(line_num != 33) %>% # special case for the BUS 
     filter(!fare_media %in% c("ADULT","SENIOR/STUDENT", "CHILDREN", "RAIL")) %>% 
     select(-line_num)   # done its job - thank you!

# Let us now display the TTC tidy data in a table 
knitr::kable( df_tidy, 
              caption = 'The Troronto Ridership data in tidy format.',
              booktabs = TRUE
              ) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center") %>%
scroll_box(width = "800px", height = "500px")
Table 3.1: The Troronto Ridership data in tidy format.
type fare_media type_of fy_2017 fy_2016 fy_2015 * fy_2014 fy_2013 fy_2012 fy_2011 fy_2010 fy_2009 fy_2008 fy_2007 fy_2006 fy_2005 fy_2004 fy_2003 fy_2002 fy_2001 fy_2000 fy_1999 fy_1998 fy_1997 fy_1996 fy_1995 fy_1994 fy_1993 fy_1992 fy_1991 fy_1990 fy_1989 fy_1988 fy_1987 fy_1986 fy_1985
WHO TOKENS adult 76106 102073 110945 111157 112360 117962 124748 120366 114686 94210 69134 75340 82162 80859 80330 82102 83771 82218 83028 85303 86991 87857 87775 97877 104016 114064 111365 119538 114874 122180 127088 126217 128207
WHO TICKETS adult 0 0 0 0 0 0 0 1298 8807 34445 65398 68546 73151 72952 71485 74578 70930 66331 64109 66490 66177 67164 70369 62700 57710 53655 35788 38369 37401 39514 38944 42052 48793
WHO TWO-FARE adult 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4329 12525 13265 12894 11201 17927 22313 27025 18837 8976 7347 0
WHO PRESTO - SINGLE RIDE adult 67829 27397 13323 9862 8194 4399 1139 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO PRESTO - SRVM TOKEN RIDE adult 1271 1157 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO PRESTO - SRVM CASH RIDE adult 821 582 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO PRESTO - MONTHLY PASS adult 1613 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO REGULAR MONTHLY PASS adult 174549 194820 204509 214932 213982 205086 194928 203101 208172 203313 195001 171314 140594 125836 119681 116805 118176 112081 103447 98473 91521 86549 96803 96907 100607 109509 108148 116610 113506 119264 109151 101901 94970
WHO POST-SECONDARY PASS adult 52721 51861 48396 42855 38426 35019 32091 9200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO TWIN-GO PASS adult 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 762 1118 1105 1430 1649 1592 1652 1976 2111 2514 2924 3235 2758 0 0 0 0 0
WHO WEEKLY PASS adult 6653 7547 8843 9361 9557 10185 9893 9237 8738 7517 7126 5413 1296 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO CASH adult 36045 41536 48873 49120 48623 46467 43795 43149 41445 39408 36317 38684 47521 55172 51328 51052 58400 61539 54835 49658 46209 32642 20930 20708 22131 23696 60034 67296 65665 66872 75308 66475 63986
WHO MONTHLY PASS senior_student 27324 27621 25092 23064 20509 19769 18590 17169 15331 14864 14506 12931 11068 9940 10586 11123 12397 11785 10124 9419 8647 9098 11028 10716 10508 10389 5850 5117 4570 4541 3855 3288 2519
WHO WEEKLY PASS senior_student 1011 959 672 515 540 624 702 814 874 780 686 372 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO TICKETS senior_student 31195 32997 32595 33408 35472 37039 38299 38674 38615 39097 40181 40808 42746 41562 41844 44018 44012 43885 44263 46559 48306 52852 58515 56539 56625 57082 56559 61358 61837 65708 65927 62963 61193
WHO TWO-FARE senior_student 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 93 63 0 0 0
WHO PRESTO - SINGLE RIDE senior_student 5703 1421 438 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO PRESTO - SRVM CASH RIDE senior_student 253 210 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO PRESTO - MONTHLY PASS senior_student 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO CASH senior_student 12532 10440 12170 12037 8538 8164 7609 5856 5526 5253 4211 4581 6549 7602 6759 6440 7507 7921 7481 7197 7665 8114 5202 4772 4703 3725 6030 6324 6267 6359 7449 10852 13117
WHO FREE RIDES children 24856 21875 10939 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO TICKETS children 0 0 1066 7097 7563 7929 8304 8287 8562 8782 8959 8879 8143 7573 7915 8869 9133 9401 9186 9640 9667 10129 11338 12192 12193 11569 11722 12417 12261 13194 12581 12074 10884
WHO PRESTO - FREE CHILD RIDE children 163 36 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WHO CASH children 0 0 526 3705 2708 2589 2433 2539 2410 2253 1933 2168 3916 4514 4072 3938 3999 4203 4209 4344 4313 3674 2877 2717 2486 1860 2546 3071 2916 3175 3526 3844 4768
WHO DAY/VIST./OTHER children 6728 9130 8561 10033 11428 11929 10642 10605 10880 9961 9636 9194 7598 6191 5817 9685 4943 4837 5576 5236 4611 4234 4833 4764 4228 1853 2572 972 1312 641 612 649 630
WHO BLIND/WAR AMPS children 1086 1088 1086 1119 1109 1086 1060 1073 1074 1092 1094 1025 1026 1065 1113 1275 1223 1138 1131 1137 1057 1057 1059 1119 1122 1127 1093 1098 1139 1252 1593 1506 1272
WHO PREMIUM EXPRESS children 448 474 490 451 401 372 344 322 313 310 295 259 260 278 280 282 339 330 301 307 310 322 425 437 417 257 0 0 0 0 0 0 0
WHO POSTAL CARRIERS children 0 0 0 0 0 0 0 0 0 0 0 58 702 700 664 683 719 753 790 783 902 1107 1122 1085 1331 1340 1298 1993 1860 1875 1874 1844 1821
WHO GTA PASS children 4283 4855 5471 6087 5784 5388 5642 5667 5800 5415 5292 4972 4395 3855 3538 3927 3326 3031 2683 2494 1915 1650 1375 343 0 0 0 0 0 0 0 0 0
WHERE BUS the_bus 261113 252899 238943 245292 239968 234582 223269 219855 218545 215997 216341 206526 209891 201954 195817 207392 205762 203234 196785 195720 196056 188070 192497 191020 189060 192495 189150 202681 198078 208057 203400 188912 192690
WHERE SUBWAY rail 213012 221622 228129 219849 217250 216101 213280 199131 199321 196004 191338 181736 174046 169501 163736 163446 164214 158277 150531 147369 143751 142081 148349 151214 153020 158658 164714 179815 177302 178577 174963 166892 161914
WHERE S.R.T. rail 3177 2951 3352 4254 4661 4667 4766 4232 4300 4639 4700 4166 4440 4117 3612 4128 3757 3480 2835 2514 2544 2317 2633 2951 3465 2672 2854 3518 3567 3001 3361 3248 2099
WHERE TROLLEY COACH rail 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1805 1077 13814 16117 15403 15217 17828 17771 17355
WHERE STREETCAR rail 55914 60607 63581 65420 63315 58657 58904 54139 49067 50060 47390 52116 42843 42527 42247 40573 46260 45567 42442 43086 37532 39962 44673 43067 46135 49349 53635 57103 56376 58623 57332 64189 58102
WHEN WEEKDAY day_of_week 424155 424117 423808 423269 416297 406913 395578 379810 374908 374765 368917 357814 348600 340549 332551 339780 343611 334198 320592 317972 311422 305983 318839 319087 324529 334515 350714 379987 374024 385093 382917 369295 358947
WHEN WEEKEND/HOLIDAY day_of_week 109061 113962 110197 111546 108897 107094 104641 97547 96325 91935 90852 86730 82620 77550 72861 75759 76382 76360 72001 70717 68461 66447 69313 69165 68956 69736 73453 79247 76702 78382 73967 71717 73213

3.3 Visualizing the TTC data with ggplot

Once we have converted the Excel into a tidy data, we can generate all hosts of statistical summaries. On this case, I will modify the data further and generate a data visualization with ggplot2, another powerful data visualization package from tidyverse. The rendered graphics summarizes the 32 year TTC data so that we can easily see the visual summary side by side. I will limit the analysis to generating three plots (See Figure 3.2, each one of the figures summaries different parts of the data, all in one pane.
1. The Toronto public transport ridership trend over 32 years(’85-’17).
2. Bus and Rail trend over the same 32 years period.
3. And the trend or ridership by age group.

# Vizualizelabel_number_si
gg_ttc <-
df_tidy %>% 
        pivot_longer(cols = fy_2017:fy_1985) %>%
        group_by(type, type_of, name) %>%
        summarise(total = mean(value)) %>%
        ungroup() %>%
        mutate_all(~str_replace_all(., "^fy_", "")) %>% 
        mutate_each(list(as.numeric), 3:4) %>% 
        mutate(type_two = case_when(type == "WHEN" ~ "Number of Riders (when)",
                                   type == "WHERE" ~ "Mode of transport Bus or Rail(where)",
                                   type == "WHO" ~ "Adult/Seniors_students/Children (Who)"
                                  )) %>%
ggplot(aes(x = name, y = total, group = type_of)) +
        geom_line(aes(col = type_of), size = 2) +
        scale_y_continuous(labels = scales::number_si, breaks = seq(2000, 270000, by = 20000)) +
        scale_x_continuous(breaks = seq(1985, 2019, by = 5)) +
        facet_wrap(type_two ~., scales = "free_y", nrow = 1) + 
        theme_abiyu() +
        theme(legend.position = "right",
              legend.title=element_blank()) +
        labs(title = "1985-2017 Toronto Public transport Ridership Trend",
             #subtitle = "",
             caption = "Source: TTC | summary &Viz: @abiyugiday",
             x = "years",
             y = "total rider ship")

gg_ttc
TTC transit trend from 1985-2017

Figure 3.2: TTC transit trend from 1985-2017

4 Cash Flow Excel Transformation

The Cash Flow Excel sheet is an example of what a typical cash flow report might look like. Again, we are going to apply the same workflow as TTC from section 3. We will use readxl to import the data into R, identify the items we want to modify to turn it into tidy format, and generate a script that will address the identified items and finally generate a visual graphics that will summarize the data.

4.1 Cash Flow Excel

As you can see on Figure 4.1, I have identified 7 items that we will need to consider on the cash flow Excel data. This includes, removing the header and footer, remove the empty columns in between months, adding a feature for the major expense types (cash inflow, cash outflow and other cash out flow) and removing total amounts from the Excel.

Cash flow annotated.

Figure 4.1: Cash flow annotated.

4.2 Cash Flow Tidy

Again, we start by importing the Excel into R with the read_excel function. The import will skip the first three rows removing the header that we will not use. We then use the scoped variant functions of the tidyverse to clean the data, and use the select, mutate and filter functions to transform the data.
( Note: read the notes following # in the code to follow along the transformation. )

# Import Excel and skip the first 3 raws, which remove the header
cash_flow_1 <- read_excel("data/cash_flow_1.xlsx", skip = 3)

# Use the imported data, remove all empty columns, remove character such as barackets, colons and asterisk, create a new vriable and coppy values, choose the relvant variables, remove rows that contain only missing values (na) and remove columns that will not be used to create a tidy format (ex the Excel summary values that contain 'TOTAL)
df_cashFlow_tidy <- 
cash_flow_1 %>%
        select_if(~!all(is.na(.))) %>% 
        # inflow clean up
        mutate_all(~str_replace_all(., "\\(", "")) %>% 
        mutate_all(~str_replace_all(., "\\)", "")) %>% 
        mutate_all(~str_replace_all(., ":", "")) %>% 
        mutate_all(~str_replace_all(., " ", "_")) %>% 
        mutate_all(~str_replace_all(., "\\*", "")) %>% 
        #tidy step
        select(type_1 = 1, type_2 =2 , everything()) %>%
        mutate(type_of = case_when(type_1 == "Cash_Inflows_Income" ~ "Cash_Inflows_Income",
                                   type_1 == "Cash_Outflows_Expenses" ~ "Cash_Outflows_Expenses",
                                   type_1 == "Other_Cash_Out_Flows" ~ "Other_Cash_Out_Flows"
        )) %>%
        select(type_1, type_2, type_of, everything())  %>%
        fill(type_of) %>% 
        tidyr::drop_na(type_2) %>% 
        mutate_all(tolower) %>%
        filter(!stringr::str_detect(type_2, "total")) %>% 
        select(-type_1, -TOTALS) 

# We will now render and print out a formatted tidy data as seen in Table 4.1
knitr::kable( df_cashFlow_tidy, 
              caption = 'The Troronto Ridership in tidy format.',
              booktabs = TRUE
              ) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center") %>%
scroll_box(width = "800px", height = "500px")
Table 4.1: The Troronto Ridership in tidy format.
type_2 type_of April May June July Aug Sept Oct Nov Dec Jan Feb Mar
cash_collections cash_inflows_income 4419 1198 -869 -769 3875 -3059 -3164 -4995 2646 -3945 -1894 1690
credit_collections cash_inflows_income -1698 4660 2814 4607 -943 2582 963 1997 -3995 4792 -839 422
investment_income cash_inflows_income -4260 4307 -684 4888 -3672 2330 -4344 -2361 -924 4614 -2765 1750
other cash_inflows_income -3231 -4810 -3050 4442 -1081 1268 2456 -4041 -4616 4478 2514 -3159
advertising cash_outflows_expenses 3295 4587 3432 2983 -2939 -2683 -4333 4052 2874 -4376 -1353 -4956
bank_service_charges cash_outflows_expenses 136 -3635 -3901 -802 -4306 2050 -841 555 -2577 -743 3414 -3909
insurance cash_outflows_expenses -2360 -2477 1148 -3858 3518 4453 -4665 -2059 -2045 327 896 -2656
interest cash_outflows_expenses 4497 -93 -4514 2501 4345 -4023 1616 4465 -4967 -3901 1266 -4184
inventory_purchases cash_outflows_expenses -764 2915 -4563 -3212 -3851 1385 -2026 -3229 -4869 -3118 4539 -698
maintenance_&_repairs cash_outflows_expenses 985 -2879 -3116 -4854 3715 -3231 -3704 -3653 2 3204 -4880 -2242
operating_supplies cash_outflows_expenses 3530 1674 4467 -370 4533 -3254 -164 -1731 -4495 -394 -3415 1236
payroll cash_outflows_expenses -3249 1973 2359 1528 1963 -758 -2437 -868 2931 -3798 -1279 2765
payroll_expenses cash_outflows_expenses 2786 -3061 138 2356 -4213 -4661 3296 -1736 662 2401 -3032 4176
sales_commissions cash_outflows_expenses -2983 -4151 -1732 695 -545 1561 382 2175 3617 -1627 2027 -2025
professional_fees cash_outflows_expenses -4487 4047 783 -4517 1103 776 974 -1764 -3889 4451 -3238 -1509
communications_equip. cash_outflows_expenses 2804 2664 -3194 2986 378 999 -2573 -1119 1180 -1663 4444 1549
rent cash_outflows_expenses 3449 -2594 -3879 1591 -3892 3706 1439 -4092 2842 -4741 4013 4912
office_supplies cash_outflows_expenses 2335 1633 -232 2671 -3212 4593 2614 -2096 -847 -3336 1876 2579
permits_&_licenses cash_outflows_expenses 2129 -2288 4199 3191 -2516 534 -4839 -1453 1496 -1000 -4695 -3374
utilities_&_telephone cash_outflows_expenses -2947 -1485 4847 -2918 -887 316 -592 4877 -739 -4195 1619 4061
travel cash_outflows_expenses -2093 2709 -4053 1227 4932 -1445 -4039 -2004 -3570 -250 -150 316
taxes cash_outflows_expenses -859 175 -143 -10 846 185 -3101 3151 -2675 3488 -2132 3622
other cash_outflows_expenses 998 -2835 -988 -1206 2069 -627 2095 -4390 2580 -4456 -3033 4786
sales_training cash_outflows_expenses 1623 457 3136 1675 -2873 -3310 -2990 2721 957 3453 4799 -3701
capital_purchases other_cash_out_flows 1901 3811 -1270 2393 -3280 2252 -968 -1413 -1794 1632 -939 -2306
loan_principal other_cash_out_flows -2555 -4169 -4517 -4410 1893 -2379 3541 2526 3118 509 251 157
owner’s_draw other_cash_out_flows 1818 1566 -2228 2396 -3597 -1463 3439 4057 -1076 -4961 588 3896
other other_cash_out_flows -4413 -595 -2891 3857 -4781 4794 -897 -4500 -4005 4881 378 -425

4.3 Visualizing the Cash flow data with ggplot

We can generate different types of visualization for the cash flow data, but on this example I will create small multiple line chart for all the items for inflow income, outflow income and other outflow expenses. All together, there are 26 categories and we want to be able to see he ebbs-and-flows of each category on its own, and plot them in a single pan so that it is easier for side by side comparison.

# Again, I have modified the data further by selecting the months _FY Apr - Sept_, 
# and pivot the data to a long form to preare it for the plot
# The plot used the facet_wrap function to genrate the small multiples.

df_cashFlow_tidy %>%
        mutate_each(list(as.numeric), April:Mar) %>%
        pivot_longer(cols = April:Mar) %>%
ggplot(aes(x = name, y = value, group = type_2)) +
        geom_line() +
        facet_wrap(~ type_2, scales = "free_y") +
        scale_y_continuous(labels = scales::number_si)  +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        theme_abiyu() +
        labs(x ="", y = "")
12 Months of Cash flow trends by category.

Figure 4.2: 12 Months of Cash flow trends by category.

5 Australia Marriage Survey Excel

“The Australian Marriage Law Postal Survey was a national survey designed to gauge support for legalizing same-sex marriage in Australia. The survey was held via the postal service between 12 September and 7 November2017.” The result was released by Australia Bureau of Statistics in Excel. There are several tabs, but we are going to focus on table 1.

5.1 Australia Excel data import and transformation

The Excel released is again is made easier to read, but not in tidy format needed to extend the analysis. Following the workflow established for TTC and cash flow Excel, I have identified eight items (see Figure 5.1 that will need to be addressed to convert the data into tidy format. The items include removing the header and footer, removing empty columns and raws, handling merged cells (for yes/no) and response types (response clear/eligible participants).

Australia Marriage Survey Excel annotated.

Figure 5.1: Australia Marriage Survey Excel annotated.

5.2 Australia Marriage Survey Tidy

We will again import the data using the read_excel function from the readxl library. This time, we will deal with the header with slice function just to show different way of handling the data. The resulting tidy data will contain both the response clear participants and the eligible participants.
( Note: read the notes following # in the code to follow along the transformation )

# Import the Excel - notice the header is not skipped
ozmarriage_table1 <- read_excel("data/ozmarriage_table1.xlsx")

# tidy data: We use the `select` function to rename each of the variables with type of vote in the header.
# We then pivot the data to long format and removed empty rows
# Pivot to wider format again for easier display and made sure the number class is set to numeric.

oz_mrg_surv_2017 <- 
ozmarriage_table1 %>% 
        select( states = 1, 
               rep_clr_yes = 2, 
                rep_clr_no = 4, 
               rep_clr_tot = 6,
               elg_resp_nc = 11,
               elg_non_rsp = 13,
               elg_tot = 15
               ) %>%
        slice(4:15) %>% 
        pivot_longer(cols = rep_clr_yes:elg_tot) %>%
        drop_na() %>%
        pivot_wider(names_from = "name", values_from = "value") %>%
        mutate_each(list(as.numeric), rep_clr_yes:elg_tot) 

write.csv(oz_mrg_surv_2017, file = "data/oz_mrg_surv_2017.csv", row.names = FALSE)

# We will now render and print out a formatted tidy data as seen in Table 5.
knitr::kable( oz_mrg_surv_2017, 
              caption = 'The Troronto Ridership in tidy format.',
              booktabs = TRUE
              ) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
Table 5.1: The Troronto Ridership in tidy format.
states rep_clr_yes rep_clr_no rep_clr_tot elg_resp_nc elg_non_rsp elg_tot
New South Wales 2374362 1736838 4111200 11036 1065445 5187681
Victoria 2145629 1161098 3306727 11028 743634 4061389
Queensland 1487060 961015 2448075 7088 695710 3150873
South Australia 592528 356247 948775 2778 242027 1193580
Western Australia 801575 455924 1257499 3188 346333 1607020
Tasmania 191948 109655 301603 805 77020 379428
Northern Territory(b) 48686 31690 80376 229 57496 138101
Australian Capital Territory(c) 175459 61520 236979 534 50595 288108
Australia 7817247 4873987 12691234 36686 3278260 16006180

5.3 Visualizing the Australia’s 2017 marriage survey

Now that we have the tidy data, we can extract insight and generate all host of visualization. But to keep it light, I chose to focus on the yes or no response voters distribution, for each of the eight territories and the total survey results of Australia. Bar chart is good choice to display categorical data, and so we will generate small multiple for each territory and the whole Australia in a single pane with ggplot facet function. But before generating the small multiple, , the tidy data is transformed further to generate the desired bar plot.
( Note: read the inline comment in the code for explanation )

# transoform the data = select the variables that will be used, pivot the data to long format, count and add variable that contains the calculated  percentile for each small multiple yes and no vote, and use the perctile as lable on top of the bar chart.

gg_1 <- 
oz_mrg_surv_2017 %>%
        select(states, rep_clr_yes, rep_clr_no) %>%
        pivot_longer(cols = rep_clr_yes:rep_clr_no ) %>%
        group_by(states) %>% 
        mutate(prcnt = scales::percent(value/sum(value))) %>%
ggplot(aes(x = name, y = value, fill = name)) +
        geom_bar(stat="identity", position = "dodge",color="black") +
        facet_wrap(~ states, scales = "free_y") +
        scale_y_continuous(labels = scales::number_si)  +
        geom_text(aes(label = prcnt), 
                  position = position_dodge(0.9), 
                  vjust = 2,
                  colour = "white",
                  size = 4,
                  fontface = "bold") +
        scale_fill_manual(name = "", labels = c("Respn Clear No", "Respns Clear Yes"), values=c("palevioletred2", "royalblue2")) +
        theme_minimal() +
        labs(x ="", y ="",
             title = "Australians support Marriage law change",
             subtitle = "2017 Australia Marriage Law Survey",
             caption = "Data Source: Australian Buraue of Statstics") +
        theme_abiyu() 
gg_1 
2017 OZ Marriage Survey Result & Choropleth Map

Figure 5.2: 2017 OZ Marriage Survey Result & Choropleth Map

5.4 Visualize the Thematic Map

One of the capabilities R bring in to Excel is applying data to a geospatial map, allowing the generation of thematic map (aka Choropleth map). Thematic maps allow for scaled color of a map according to a numeric value. Here I will have merged Australia’s states geocode with a tidy survey data result values. Covering how to do that is out of the scope of this blog. R has a number of libraries that will provide plotting a map. simple feature is “standardized way to encode spatial vector data”, and one that is recommended with robust features to generate thematic maps. you can read more about it here. To keep all what we need with tidyverse, I am going to use the ggplot2 data frame to generate the thematic map.

I will start with the tidy survey data (df_oz_vote) we have created from the Excel earlier. Use the dplyr verbs select, filter, pivot and mutate to transform the data and merge it with the center of each state, and use ggplot to render the plot, see on Figure 5.3. The choropleth map of Australia is generated with darker color indicating higher percentage yes or no votes, and lighter color indicating lower percentage yes or no vote. The percentage text is labeled for each state, and you can easily compare each state’s yes/no vote side by side. Most Australians voted yes! As the ol’ adage goes “A picture is worth a thousand words!

# get the tidy survey data
df_oz_vote <- read_csv("data/df_oz_vote.csv")

# The following script, selects the yes and no votes for all states - save for total Australia,
# and clean the names of the states to match with the center data
df_yes_no <- 
oz_mrg_surv_2017 %>% 
        select(admin_1 = 1, prcnt_yes = 2, prcnt_no = 3) %>%
        pivot_longer(cols = prcnt_yes:prcnt_no ) %>% 
        filter(admin_1 != "Australia") %>% 
        mutate(prcnt = scales::percent(value/sum(value))) %>%
        mutate(admin_1 = tolower(admin_1)) %>%
        mutate(admin_1 = str_replace_all(admin_1, "\\(b\\)", "")) %>% 
        mutate(admin_1 = str_replace_all(admin_1, "\\(c\\)", ""))  

# A function to calculate the center of the terrotories from the goecode
mid_range <- function(x) mean(range(x, na.rm = TRUE))                 

# Use plyr library ddply function to generate center (close to center)
center <-  plyr::ddply(df_oz_vote, .(region), colwise(mid_range, .(long, lat)))  

# Merge the two data frames to generate a label data frame.
df_label <- center %>% 
               mutate(admin_1 = str_replace_all(region, "^state of ", "")) %>% 
               left_join(df_yes_no, by = "admin_1") %>% 
               select(-region) %>%
               arrange(desc(name))

# plot the vis no and a yes
ggplot() + 
        geom_polygon(data = df_oz_vote, aes(x = long, y = lat, group = group, fill = value), 
                     linetype = "solid", col = "gray56") + facet_wrap(~ name) +  
        geom_text(data = df_label, aes(x = long, y = lat, label = prcnt), size = 3, col = "black") +
        scale_fill_distiller(name = "percent %", palette = "YlOrRd", direction = 1) + 
        theme(legend.position = "bottom") +
        labs(title = " 2017 Australia Marriage Survey - The Yes have it",
             caption = "data_source: Australian Buraue of Statstics\nMap: ChoroplethrAdmin1 | by @abiyugiday") +
        theme_abiyu() 
knitr::include_graphics("fig/blog3_oz_choro_srvy_vote.png")
Australia Marriage Survey thematic map

Figure 5.3: Australia Marriage Survey thematic map

6 R extending the power of Excel

Excel remains one of the most useful and powerful application to organize and work with data. With explosion of data across industries, Excel reports are produced at a faster clip than ever. As demonstrated above, R is ideal complimentary software that will extend and enhance Excel reports to maximize value extraction. On this blog we covered three different types of Excel reports and discussed how to transform each to a tidy format with tidyverse libraries.

We used the tidy data sets to generate visual summaries that enable us to understand and share. But with further analysis, the tidy data can answer a lot of questions, which would have been difficult otherwise. The analysis scripts can be easy or complicated, depending on the Excel report type, but once created they can be used to automate any further report with similar makeup, because they are reproducible.

Hope you found this blog useful! If you have any questions or comments, feel free to drop me an e-mail at abiyu.giday@datarecode.com or DM me on twitter @abiyugiday. I am also available to discuss with you should you organization need training or have Excel report you want extended with R